29 research outputs found
Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods
Stochastic gradient descent (SGD) and adaptive gradient methods, such as Adam
and RMSProp, have been widely used in training deep neural networks. We
empirically show that while the difference between the standard generalization
performance of models trained using these methods is small, those trained using
SGD exhibit far greater robustness under input perturbations. Notably, our
investigation demonstrates the presence of irrelevant frequencies in natural
datasets, where alterations do not affect models' generalization performance.
However, models trained with adaptive methods show sensitivity to these
changes, suggesting that their use of irrelevant frequencies can lead to
solutions sensitive to perturbations. To better understand this difference, we
study the learning dynamics of gradient descent (GD) and sign gradient descent
(signGD) on a synthetic dataset that mirrors natural signals. With a
three-dimensional input space, the models optimized with GD and signGD have
standard risks close to zero but vary in their adversarial risks. Our result
shows that linear models' robustness to -norm bounded changes is
inversely proportional to the model parameters' weight norm: a smaller weight
norm implies better robustness. In the context of deep learning, our
experiments show that SGD-trained neural networks show smaller Lipschitz
constants, explaining the better robustness to input perturbations than those
trained with adaptive gradient methods
Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
We address the problem of automatic generation of features for value function
approximation. Bellman Error Basis Functions (BEBFs) have been shown to improve
the error of policy evaluation with function approximation, with a convergence
rate similar to that of value iteration. We propose a simple, fast and robust
algorithm based on random projections to generate BEBFs for sparse feature
spaces. We provide a finite sample analysis of the proposed method, and prove
that projections logarithmic in the dimension of the original space are enough
to guarantee contraction in the error. Empirical results demonstrate the
strength of this method
Efficient and Accurate Optimal Transport with Mirror Descent and Conjugate Gradients
We design a novel algorithm for optimal transport by drawing from the
entropic optimal transport, mirror descent and conjugate gradients literatures.
Our scalable and GPU parallelizable algorithm is able to compute the
Wasserstein distance with extreme precision, reaching relative error rates of
without numerical stability issues. Empirically, the algorithm
converges to high precision solutions more quickly in terms of wall-clock time
than a variety of algorithms including log-domain stabilized Sinkhorn's
Algorithm. We provide careful ablations with respect to algorithm and problem
parameters, and present benchmarking over upsampled MNIST images, comparing to
various recent algorithms over high-dimensional problems. The results suggest
that our algorithm can be a useful addition to the practitioner's optimal
transport toolkit